These two maps depict the CalEnviroScreen 4.0 scores for high Asthma and PM2.5 prevalence in California from 2021. High asthma prevalence is measured by the rate of emergency department visits per 10,000, while high PM2.5 prevalence is measured by the annual mean concentration of PM2.5, which refers to particles that are two and a half microns or less in width. The areas of California that have the highest prevalence of PM2.5 is in the Central California area, around Fresno, Hanford, Visalia, and Bakersfield. The areas of California that have the highest prevalence of Asthma seems to be Central California again, as well the LA area.

The best fit line on the graph points to a clear correlation between the number of particles less than 2.5 microns wide, and the cases of asthma in a certain area. The more particles there are, the higher the number of cases of asthma there are. The fitness of this best fit line could be better, though, since it ignores a lot of the points in the middle, treating them as outliers, instead of factoring them into the line.

## 
## Call:
## lm(formula = Asthma ~ PM2.5, data = ces4_map)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.424 -21.485  -6.539  13.432 193.479 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.4917     1.6229   21.25   <2e-16 ***
## PM2.5         1.7228     0.1564   11.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.34 on 8022 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.01491,    Adjusted R-squared:  0.01479 
## F-statistic: 121.4 on 1 and 8022 DF,  p-value: < 2.2e-16

The model coefficient is statistically significant because the p values of the intercept and the PM2.5 coefficient are very small. Furthermore, the increase of 1 in x (PM2.5) is associated with an increase of 1.7228 in y (asthma). Also, 1.491% of the variation in the y (asthma) is explained by the variation in x (PM2.5).

The residual distribution is wrong because the residuals are not normally distributed, but instead there is a long positive tail.

After performing a logarithmic transformation, the best fit line is steeper, and there are less outliers. It accounts for more of the data points, and overall is a better fit line than without the logarithmic transformation.

## 
## Call:
## lm(formula = log(Asthma) ~ PM2.5, data = ces4_map)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4046 -0.3767  0.0252  0.3826  1.7603 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.34395    0.03062  109.20   <2e-16 ***
## PM2.5        0.04387    0.00295   14.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5725 on 8022 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.02682,    Adjusted R-squared:  0.0267 
## F-statistic: 221.1 on 1 and 8022 DF,  p-value: < 2.2e-16

After performing a linear regression analysis with a log transformation on my model, the model coefficient remains statistically significant. The p values of the intercept and the PM2.5 coefficient stay the same, and are still very small (<2e-16). The increase of 1 in x (PM2.5) is associated with an increase of 0.04387 in y (asthma). Also, 2.682% of the variation in the y (asthma) is explained by the variation in x (PM2.5).

After plotting the residual distribution with log(y) values, the residuals are more normally distributed, and there aren’t any tails, meaning this is a correct residual distribution.